This is a dataset of an ecommerce company and we have some customers who are churning (leaving).
Perform exploratory analysis of the provided customer data to share insights of the behavior and characteristics of the customers. Make suggestions to help the company with customer retention
Build a predictive model to identify customers who are at risk of leaving the company (churn) based on the provided variables. This can help the company take proactive steps to retain these customers and reduce the rate of churn
CustomerID
Churn: Churn Flag
Tenure: in months
PreferredLoginDevice
CityTier
WarehouseToHome: Distance in between warehouse to home of customer
PreferredPaymentMode
Gender
HourSpendOnApp
NumberOfDeviceRegistered
PreferedOrderCat
SatisfactionScore
MaritalStatus
NumberOfAddress
OrderAmountHikeFromlastYear: Percentage increases in order from last year
CouponUsed: Total number of coupon has been used in last month
OrderCount: Total number of orders has been places in last month
DaySinceLastOrder
CashbackAmount: Average cashback in last month
Complain: Complain flag - if the customer ever had a complain
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_excel('E Commerce Dataset.xlsx')
df.head(5)
| CustomerID | Churn | Tenure | PreferredLoginDevice | CityTier | WarehouseToHome | PreferredPaymentMode | Gender | HourSpendOnApp | NumberOfDeviceRegistered | PreferedOrderCat | SatisfactionScore | MaritalStatus | NumberOfAddress | Complain | OrderAmountHikeFromlastYear | CouponUsed | OrderCount | DaySinceLastOrder | CashbackAmount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50001 | 1 | 4.0 | Mobile Phone | 3 | 6.0 | Debit Card | Female | 3.0 | 3 | Laptop & Accessory | 2 | Single | 9 | 1 | 11.0 | 1.0 | 1.0 | 5.0 | 159.93 |
| 1 | 50002 | 1 | NaN | Phone | 1 | 8.0 | UPI | Male | 3.0 | 4 | Mobile | 3 | Single | 7 | 1 | 15.0 | 0.0 | 1.0 | 0.0 | 120.90 |
| 2 | 50003 | 1 | NaN | Phone | 1 | 30.0 | Debit Card | Male | 2.0 | 4 | Mobile | 3 | Single | 6 | 1 | 14.0 | 0.0 | 1.0 | 3.0 | 120.28 |
| 3 | 50004 | 1 | 0.0 | Phone | 3 | 15.0 | Debit Card | Male | 2.0 | 4 | Laptop & Accessory | 5 | Single | 8 | 0 | 23.0 | 0.0 | 1.0 | 3.0 | 134.07 |
| 4 | 50005 | 1 | 0.0 | Phone | 1 | 12.0 | CC | Male | NaN | 3 | Mobile | 5 | Single | 3 | 0 | 11.0 | 1.0 | 1.0 | 3.0 | 129.60 |
df.shape
(5630, 20)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5630 entries, 0 to 5629 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 5630 non-null int64 1 Churn 5630 non-null int64 2 Tenure 5366 non-null float64 3 PreferredLoginDevice 5630 non-null object 4 CityTier 5630 non-null int64 5 WarehouseToHome 5379 non-null float64 6 PreferredPaymentMode 5630 non-null object 7 Gender 5630 non-null object 8 HourSpendOnApp 5375 non-null float64 9 NumberOfDeviceRegistered 5630 non-null int64 10 PreferedOrderCat 5630 non-null object 11 SatisfactionScore 5630 non-null int64 12 MaritalStatus 5630 non-null object 13 NumberOfAddress 5630 non-null int64 14 Complain 5630 non-null int64 15 OrderAmountHikeFromlastYear 5365 non-null float64 16 CouponUsed 5374 non-null float64 17 OrderCount 5372 non-null float64 18 DaySinceLastOrder 5323 non-null float64 19 CashbackAmount 5630 non-null float64 dtypes: float64(8), int64(7), object(5) memory usage: 879.8+ KB
for columns in df:
print(columns,df[columns].nunique())
CustomerID 5630 Churn 2 Tenure 36 PreferredLoginDevice 3 CityTier 3 WarehouseToHome 34 PreferredPaymentMode 7 Gender 2 HourSpendOnApp 6 NumberOfDeviceRegistered 6 PreferedOrderCat 6 SatisfactionScore 5 MaritalStatus 3 NumberOfAddress 15 Complain 2 OrderAmountHikeFromlastYear 16 CouponUsed 17 OrderCount 16 DaySinceLastOrder 22 CashbackAmount 2586
Note: Churn = 1 means customer has churned
df[df['Churn'] == 1].count()
CustomerID 948 Churn 948 Tenure 867 PreferredLoginDevice 948 CityTier 948 WarehouseToHome 864 PreferredPaymentMode 948 Gender 948 HourSpendOnApp 890 NumberOfDeviceRegistered 948 PreferedOrderCat 948 SatisfactionScore 948 MaritalStatus 948 NumberOfAddress 948 Complain 948 OrderAmountHikeFromlastYear 934 CouponUsed 940 OrderCount 930 DaySinceLastOrder 894 CashbackAmount 948 dtype: int64
#check if there is any n/a
df['Churn'].isna().nunique()
1
#churn rate
df[df['Churn'] == 1].shape[0]/df.shape[0]*100
16.838365896980463
for col in df.columns:
print(col, df[col].isna().sum())
CustomerID 0 Churn 0 Tenure 264 PreferredLoginDevice 0 CityTier 0 WarehouseToHome 251 PreferredPaymentMode 0 Gender 0 HourSpendOnApp 255 NumberOfDeviceRegistered 0 PreferedOrderCat 0 SatisfactionScore 0 MaritalStatus 0 NumberOfAddress 0 Complain 0 OrderAmountHikeFromlastYear 265 CouponUsed 256 OrderCount 258 DaySinceLastOrder 307 CashbackAmount 0
Here we will understand select variables
Show histograms for all numeric columns. Describe the each variable's distribution briefly to a business stakeholder
numerical_df = df[['CustomerID','Churn','Tenure','CityTier','WarehouseToHome','HourSpendOnApp','NumberOfDeviceRegistered','SatisfactionScore','NumberOfAddress','Complain', 'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount', 'DaySinceLastOrder', 'CashbackAmount']]
numerical_df.head(5)
| CustomerID | Churn | Tenure | CityTier | WarehouseToHome | HourSpendOnApp | NumberOfDeviceRegistered | SatisfactionScore | NumberOfAddress | Complain | OrderAmountHikeFromlastYear | CouponUsed | OrderCount | DaySinceLastOrder | CashbackAmount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50001 | 1 | 4.0 | 3 | 6.0 | 3.0 | 3 | 2 | 9 | 1 | 11.0 | 1.0 | 1.0 | 5.0 | 159.93 |
| 1 | 50002 | 1 | NaN | 1 | 8.0 | 3.0 | 4 | 3 | 7 | 1 | 15.0 | 0.0 | 1.0 | 0.0 | 120.90 |
| 2 | 50003 | 1 | NaN | 1 | 30.0 | 2.0 | 4 | 3 | 6 | 1 | 14.0 | 0.0 | 1.0 | 3.0 | 120.28 |
| 3 | 50004 | 1 | 0.0 | 3 | 15.0 | 2.0 | 4 | 5 | 8 | 0 | 23.0 | 0.0 | 1.0 | 3.0 | 134.07 |
| 4 | 50005 | 1 | 0.0 | 1 | 12.0 | NaN | 3 | 5 | 3 | 0 | 11.0 | 1.0 | 1.0 | 3.0 | 129.60 |
categorical_df = df[['PreferredLoginDevice','PreferredPaymentMode','Gender','PreferedOrderCat','MaritalStatus']]
categorical_df.head(5)
| PreferredLoginDevice | PreferredPaymentMode | Gender | PreferedOrderCat | MaritalStatus | |
|---|---|---|---|---|---|
| 0 | Mobile Phone | Debit Card | Female | Laptop & Accessory | Single |
| 1 | Phone | UPI | Male | Mobile | Single |
| 2 | Phone | Debit Card | Male | Mobile | Single |
| 3 | Phone | Debit Card | Male | Laptop & Accessory | Single |
| 4 | Phone | CC | Male | Mobile | Single |
sns.histplot(data=numerical_df, x="CityTier")
<AxesSubplot:xlabel='CityTier', ylabel='Count'>
sns.pairplot(numerical_df,hue="Churn")
<seaborn.axisgrid.PairGrid at 0x7fd73b7e36a0>
Cleaning (ignore)
#As mobile phone and phone are both same so we have merged them
df.loc[df['PreferredLoginDevice'] == 'Phone', 'PreferredLoginDevice' ] = 'Mobile Phone'
df.loc[df['PreferedOrderCat'] == 'Mobile', 'PreferedOrderCat' ] = 'Mobile Phone'
#as cod is also cash on delievery
#as cc is also credit card so i merged them
df.loc[df['PreferredPaymentMode'] == 'COD', 'PreferredPaymentMode' ] = 'Cash on Delivery' # uses loc function
df.loc[df['PreferredPaymentMode'] == 'CC', 'PreferredPaymentMode' ] = 'Credit Card'
Describe these results briefly
for col in df.columns:
if df[col].dtype == object:
print(df[col].unique(), df[col].value_counts())
['Mobile Phone' 'Computer'] Mobile Phone 3996 Computer 1634 Name: PreferredLoginDevice, dtype: int64 ['Debit Card' 'UPI' 'Credit Card' 'Cash on Delivery' 'E wallet'] Debit Card 2314 Credit Card 1774 E wallet 614 Cash on Delivery 514 UPI 414 Name: PreferredPaymentMode, dtype: int64 ['Female' 'Male'] Male 3384 Female 2246 Name: Gender, dtype: int64 ['Laptop & Accessory' 'Mobile Phone' 'Others' 'Fashion' 'Grocery'] Mobile Phone 2080 Laptop & Accessory 2050 Fashion 826 Grocery 410 Others 264 Name: PreferedOrderCat, dtype: int64 ['Single' 'Divorced' 'Married'] Married 2986 Single 1796 Divorced 848 Name: MaritalStatus, dtype: int64
Provide business recommendation for each of the below
df['Churn'].corr(df['Complain'])
0.25018825469703104
churn_rate_complain = df[df['Complain'] == 1]['Churn'].mean()
churn_rate_no_complain = df[df['Complain'] == 0]['Churn'].mean()
print("Churn rate for Complain =", churn_rate_complain)
print("Churn rate for No Complain =",churn_rate_no_complain)
Churn rate for Complain = 0.3167082294264339 Churn rate for No Complain = 0.1092896174863388
Visualize the correlation between all variables
#Onc hot coding for categorical dataset
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
new_df = enc.fit_transform(categorical_df)
df_ohc = pd.DataFrame(new_df.toarray(), columns=enc.get_feature_names_out(), dtype=int)
df_ohc.head(5)
| PreferredLoginDevice_Computer | PreferredLoginDevice_Mobile Phone | PreferredLoginDevice_Phone | PreferredPaymentMode_CC | PreferredPaymentMode_COD | PreferredPaymentMode_Cash on Delivery | PreferredPaymentMode_Credit Card | PreferredPaymentMode_Debit Card | PreferredPaymentMode_E wallet | PreferredPaymentMode_UPI | ... | Gender_Male | PreferedOrderCat_Fashion | PreferedOrderCat_Grocery | PreferedOrderCat_Laptop & Accessory | PreferedOrderCat_Mobile | PreferedOrderCat_Mobile Phone | PreferedOrderCat_Others | MaritalStatus_Divorced | MaritalStatus_Married | MaritalStatus_Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
5 rows × 21 columns
df_final = pd.concat([numerical_df, df_ohc], axis=1)
df_final.head(5)
| CustomerID | Churn | Tenure | CityTier | WarehouseToHome | HourSpendOnApp | NumberOfDeviceRegistered | SatisfactionScore | NumberOfAddress | Complain | ... | Gender_Male | PreferedOrderCat_Fashion | PreferedOrderCat_Grocery | PreferedOrderCat_Laptop & Accessory | PreferedOrderCat_Mobile | PreferedOrderCat_Mobile Phone | PreferedOrderCat_Others | MaritalStatus_Divorced | MaritalStatus_Married | MaritalStatus_Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50001 | 1 | 4.0 | 3 | 6.0 | 3.0 | 3 | 2 | 9 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 50002 | 1 | NaN | 1 | 8.0 | 3.0 | 4 | 3 | 7 | 1 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 2 | 50003 | 1 | NaN | 1 | 30.0 | 2.0 | 4 | 3 | 6 | 1 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 3 | 50004 | 1 | 0.0 | 3 | 15.0 | 2.0 | 4 | 5 | 8 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 50005 | 1 | 0.0 | 1 | 12.0 | NaN | 3 | 5 | 3 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
5 rows × 36 columns
df_final = df_final.dropna()
df_final.head(5)
| CustomerID | Churn | Tenure | CityTier | WarehouseToHome | HourSpendOnApp | NumberOfDeviceRegistered | SatisfactionScore | NumberOfAddress | Complain | ... | Gender_Male | PreferedOrderCat_Fashion | PreferedOrderCat_Grocery | PreferedOrderCat_Laptop & Accessory | PreferedOrderCat_Mobile | PreferedOrderCat_Mobile Phone | PreferedOrderCat_Others | MaritalStatus_Divorced | MaritalStatus_Married | MaritalStatus_Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50001 | 1 | 4.0 | 3 | 6.0 | 3.0 | 3 | 2 | 9 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 50004 | 1 | 0.0 | 3 | 15.0 | 2.0 | 4 | 5 | 8 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 5 | 50006 | 1 | 0.0 | 1 | 22.0 | 3.0 | 5 | 5 | 2 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 11 | 50012 | 1 | 11.0 | 1 | 6.0 | 3.0 | 4 | 3 | 10 | 1 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 12 | 50013 | 1 | 0.0 | 1 | 11.0 | 2.0 | 3 | 3 | 2 | 1 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
5 rows × 36 columns
Discuss a few significant correlations
#CustomerID column as it's not relevant for correlation
df_final = df_final.drop(columns=['CustomerID'])
df_final.head(5)
| Churn | Tenure | CityTier | WarehouseToHome | HourSpendOnApp | NumberOfDeviceRegistered | SatisfactionScore | NumberOfAddress | Complain | OrderAmountHikeFromlastYear | ... | Gender_Male | PreferedOrderCat_Fashion | PreferedOrderCat_Grocery | PreferedOrderCat_Laptop & Accessory | PreferedOrderCat_Mobile | PreferedOrderCat_Mobile Phone | PreferedOrderCat_Others | MaritalStatus_Divorced | MaritalStatus_Married | MaritalStatus_Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 4.0 | 3 | 6.0 | 3.0 | 3 | 2 | 9 | 1 | 11.0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 1 | 0.0 | 3 | 15.0 | 2.0 | 4 | 5 | 8 | 0 | 23.0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 5 | 1 | 0.0 | 1 | 22.0 | 3.0 | 5 | 5 | 2 | 1 | 22.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 11 | 1 | 11.0 | 1 | 6.0 | 3.0 | 4 | 3 | 10 | 1 | 13.0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 12 | 1 | 0.0 | 1 | 11.0 | 2.0 | 3 | 3 | 2 | 1 | 13.0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
5 rows × 35 columns
corr_matrix = df_final.corr()
corr_matrix
| Churn | Tenure | CityTier | WarehouseToHome | HourSpendOnApp | NumberOfDeviceRegistered | SatisfactionScore | NumberOfAddress | Complain | OrderAmountHikeFromlastYear | ... | Gender_Male | PreferedOrderCat_Fashion | PreferedOrderCat_Grocery | PreferedOrderCat_Laptop & Accessory | PreferedOrderCat_Mobile | PreferedOrderCat_Mobile Phone | PreferedOrderCat_Others | MaritalStatus_Divorced | MaritalStatus_Married | MaritalStatus_Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Churn | 1.000000 | -0.340013 | 0.073858 | 0.087318 | 0.060845 | 0.149041 | 0.095759 | 0.076336 | 0.238137 | 0.017193 | ... | 0.033792 | 0.013086 | -0.017880 | -0.184584 | 0.016675 | 0.181683 | 0.010208 | -0.013023 | -0.159808 | 0.179481 |
| Tenure | -0.340013 | 1.000000 | -0.057414 | -0.016353 | -0.029818 | -0.058752 | -0.009972 | 0.196547 | -0.035228 | 0.012197 | ... | -0.052123 | 0.130259 | 0.091251 | 0.104482 | -0.033466 | -0.211477 | 0.103702 | 0.015788 | 0.085990 | -0.103150 |
| CityTier | 0.073858 | -0.057414 | 1.000000 | 0.004457 | -0.070035 | -0.008616 | -0.019494 | -0.046406 | -0.006122 | -0.027628 | ... | -0.018072 | 0.083035 | -0.001750 | 0.245824 | -0.097493 | -0.281079 | -0.011248 | 0.019181 | -0.029876 | 0.017370 |
| WarehouseToHome | 0.087318 | -0.016353 | 0.004457 | 1.000000 | 0.052731 | 0.024582 | 0.000434 | 0.003422 | 0.003829 | 0.031975 | ... | 0.035581 | -0.009435 | 0.003512 | 0.043231 | -0.015441 | -0.028939 | -0.035530 | -0.013650 | 0.038336 | -0.030496 |
| HourSpendOnApp | 0.060845 | -0.029818 | -0.070035 | 0.052731 | 1.000000 | 0.293021 | 0.039879 | 0.124962 | 0.020413 | 0.096827 | ... | -0.009952 | -0.147830 | -0.017382 | -0.130819 | -0.112896 | 0.281309 | 0.023109 | 0.018024 | 0.022257 | -0.037134 |
| NumberOfDeviceRegistered | 0.149041 | -0.058752 | -0.008616 | 0.024582 | 0.293021 | 1.000000 | -0.017788 | 0.066684 | 0.018881 | 0.083342 | ... | -0.026706 | -0.091604 | 0.022548 | -0.080608 | -0.125343 | 0.188563 | 0.035364 | -0.001099 | -0.064517 | 0.069345 |
| SatisfactionScore | 0.095759 | -0.009972 | -0.019494 | 0.000434 | 0.039879 | -0.017788 | 1.000000 | 0.054627 | -0.044533 | -0.008143 | ... | -0.032398 | 0.016587 | 0.027097 | -0.025448 | -0.006238 | 0.011297 | 0.030401 | 0.051528 | -0.012167 | -0.025659 |
| NumberOfAddress | 0.076336 | 0.196547 | -0.046406 | 0.003422 | 0.124962 | 0.066684 | 0.054627 | 1.000000 | -0.017363 | 0.010275 | ... | -0.030980 | -0.040136 | -0.005968 | 0.059515 | -0.031229 | -0.023102 | -0.004349 | -0.024802 | -0.001463 | 0.020125 |
| Complain | 0.238137 | -0.035228 | -0.006122 | 0.003829 | 0.020413 | 0.018881 | -0.044533 | -0.017363 | 1.000000 | 0.003260 | ... | -0.039522 | -0.000022 | 0.004534 | -0.018126 | -0.005327 | 0.019793 | 0.007866 | 0.012777 | -0.000364 | -0.009180 |
| OrderAmountHikeFromlastYear | 0.017193 | 0.012197 | -0.027628 | 0.031975 | 0.096827 | 0.083342 | -0.008143 | 0.010275 | 0.003260 | 1.000000 | ... | 0.010081 | -0.051335 | 0.041476 | -0.024964 | -0.067089 | 0.080301 | 0.021090 | -0.014517 | 0.014894 | -0.004949 |
| CouponUsed | 0.010982 | 0.075959 | 0.013313 | -0.014311 | 0.160239 | 0.114066 | 0.006687 | -0.017851 | 0.005472 | 0.048959 | ... | -0.041493 | 0.078762 | 0.078182 | -0.038409 | -0.088426 | 0.001006 | 0.082800 | 0.043312 | 0.010976 | -0.044087 |
| OrderCount | -0.001962 | 0.112339 | 0.035018 | 0.008800 | 0.092601 | 0.069559 | -0.004622 | -0.079413 | -0.001917 | 0.042950 | ... | -0.035418 | 0.154599 | 0.126821 | -0.018610 | -0.082487 | -0.083855 | 0.118690 | 0.022122 | 0.043676 | -0.062952 |
| DaySinceLastOrder | -0.139254 | 0.118932 | 0.014304 | 0.010646 | 0.019526 | -0.049817 | 0.012950 | -0.148784 | -0.056315 | 0.017716 | ... | -0.018478 | 0.157977 | 0.035204 | 0.122257 | -0.082597 | -0.221627 | 0.071340 | -0.007111 | 0.034002 | -0.030789 |
| CashbackAmount | -0.058756 | 0.213852 | 0.146569 | -0.012433 | 0.131281 | 0.113504 | 0.012061 | 0.096752 | 0.010389 | 0.014215 | ... | -0.020263 | 0.455466 | 0.014736 | 0.101787 | -0.232654 | -0.401303 | 0.443994 | 0.013801 | 0.034234 | -0.046693 |
| PreferredLoginDevice_Computer | 0.040890 | -0.010308 | -0.000184 | 0.022101 | -0.004911 | 0.025628 | -0.035021 | -0.024398 | -0.014878 | -0.000220 | ... | 0.025482 | -0.054930 | 0.032595 | 0.056688 | -0.036701 | -0.012669 | 0.005916 | -0.016560 | -0.006364 | 0.019158 |
| PreferredLoginDevice_Mobile Phone | -0.079120 | 0.085754 | 0.098949 | -0.002403 | 0.106765 | 0.042948 | -0.007731 | 0.078405 | 0.012571 | 0.004871 | ... | -0.037886 | 0.188988 | -0.014342 | 0.093415 | -0.185187 | -0.162324 | 0.021285 | 0.005117 | 0.030007 | -0.035702 |
| PreferredLoginDevice_Phone | 0.053023 | -0.096773 | -0.125199 | -0.022497 | -0.129642 | -0.084051 | 0.050272 | -0.071176 | 0.001261 | -0.005919 | ... | 0.018568 | -0.176046 | -0.019492 | -0.183909 | 0.277125 | 0.220374 | -0.033815 | 0.012652 | -0.030677 | 0.023109 |
| PreferredPaymentMode_CC | 0.030729 | -0.032474 | -0.073109 | 0.007388 | -0.054905 | -0.036118 | 0.013975 | 0.018801 | 0.019183 | -0.028536 | ... | 0.010948 | -0.035283 | -0.003861 | -0.089557 | 0.504562 | -0.067153 | -0.006698 | 0.015132 | -0.018716 | 0.008548 |
| PreferredPaymentMode_COD | 0.059430 | -0.027497 | -0.049111 | -0.004197 | -0.004520 | 0.002875 | 0.011974 | -0.009988 | -0.017256 | -0.059288 | ... | 0.027718 | -0.064820 | -0.011748 | -0.053643 | 0.053221 | 0.085898 | -0.020380 | 0.006594 | -0.029529 | 0.026426 |
| PreferredPaymentMode_Cash on Delivery | -0.006498 | 0.069208 | -0.004977 | -0.007623 | 0.016056 | -0.011988 | 0.005600 | -0.004053 | -0.013373 | 0.000700 | ... | -0.009100 | 0.193691 | 0.054822 | -0.065981 | -0.020480 | -0.068682 | 0.060785 | 0.007005 | 0.017955 | -0.024315 |
| PreferredPaymentMode_Credit Card | -0.029392 | -0.024243 | -0.150637 | 0.036141 | 0.033820 | 0.000782 | 0.043184 | 0.003998 | 0.003623 | 0.036106 | ... | -0.028839 | -0.039491 | 0.032185 | -0.033678 | -0.094304 | 0.098421 | -0.019855 | -0.001500 | 0.003140 | -0.002212 |
| PreferredPaymentMode_Debit Card | -0.039232 | 0.040958 | -0.127528 | -0.041475 | 0.020115 | 0.007461 | -0.073269 | 0.013272 | -0.002413 | 0.010242 | ... | 0.004968 | -0.004245 | -0.019560 | -0.001249 | 0.026243 | -0.008097 | 0.028682 | -0.010594 | -0.011571 | 0.020221 |
| PreferredPaymentMode_E wallet | 0.043972 | 0.016709 | 0.503165 | 0.033578 | -0.039522 | -0.008889 | 0.026658 | -0.027925 | 0.001807 | -0.036139 | ... | -0.022827 | 0.074168 | -0.014552 | 0.141394 | -0.065803 | -0.175791 | -0.001349 | 0.020559 | 0.017060 | -0.033513 |
| PreferredPaymentMode_UPI | 0.000938 | -0.043942 | -0.036088 | -0.021470 | -0.030052 | 0.010839 | 0.009321 | 0.007556 | 0.012421 | 0.033331 | ... | 0.039949 | -0.016992 | -0.011405 | 0.003837 | -0.034355 | 0.024280 | -0.019785 | -0.017968 | 0.024753 | -0.012836 |
| Gender_Female | -0.033792 | 0.052123 | 0.018072 | -0.035581 | 0.009952 | 0.026706 | 0.032398 | 0.030980 | 0.039522 | -0.010081 | ... | -1.000000 | 0.009374 | -0.005292 | 0.043367 | 0.001884 | -0.053910 | 0.006532 | 0.012541 | -0.035047 | 0.027833 |
| Gender_Male | 0.033792 | -0.052123 | -0.018072 | 0.035581 | -0.009952 | -0.026706 | -0.032398 | -0.030980 | -0.039522 | 0.010081 | ... | 1.000000 | -0.009374 | 0.005292 | -0.043367 | -0.001884 | 0.053910 | -0.006532 | -0.012541 | 0.035047 | -0.027833 |
| PreferedOrderCat_Fashion | 0.013086 | 0.130259 | 0.083035 | -0.009435 | -0.147830 | -0.091604 | 0.016587 | -0.040136 | -0.000022 | -0.051335 | ... | -0.009374 | 1.000000 | -0.014552 | -0.379275 | -0.065803 | -0.253118 | -0.025246 | 0.004190 | 0.010466 | -0.014253 |
| PreferedOrderCat_Grocery | -0.017880 | 0.091251 | -0.001750 | 0.003512 | -0.017382 | 0.022548 | 0.027097 | -0.005968 | 0.004534 | 0.041476 | ... | 0.005292 | -0.014552 | 1.000000 | -0.041501 | -0.007200 | -0.027697 | -0.002762 | -0.016429 | 0.024625 | -0.013853 |
| PreferedOrderCat_Laptop & Accessory | -0.184584 | 0.104482 | 0.245824 | 0.043231 | -0.130819 | -0.080608 | -0.025448 | 0.059515 | -0.018126 | -0.024964 | ... | -0.043367 | -0.379275 | -0.041501 | 1.000000 | -0.187659 | -0.721851 | -0.071997 | -0.000340 | 0.057496 | -0.060812 |
| PreferedOrderCat_Mobile | 0.016675 | -0.033466 | -0.097493 | -0.015441 | -0.112896 | -0.125343 | -0.006238 | -0.031229 | -0.005327 | -0.067089 | ... | -0.001884 | -0.065803 | -0.007200 | -0.187659 | 1.000000 | -0.125238 | -0.012491 | -0.009681 | -0.028832 | 0.037872 |
| PreferedOrderCat_Mobile Phone | 0.181683 | -0.211477 | -0.281079 | -0.028939 | 0.281309 | 0.188563 | 0.011297 | -0.023102 | 0.019793 | 0.080301 | ... | 0.053910 | -0.253118 | -0.027697 | -0.721851 | -0.125238 | 1.000000 | -0.048049 | 0.001864 | -0.062740 | 0.065241 |
| PreferedOrderCat_Others | 0.010208 | 0.103702 | -0.011248 | -0.035530 | 0.023109 | 0.035364 | 0.030401 | -0.004349 | 0.007866 | 0.021090 | ... | -0.006532 | -0.025246 | -0.002762 | -0.071997 | -0.012491 | -0.048049 | 1.000000 | 0.004273 | 0.019615 | -0.024032 |
| MaritalStatus_Divorced | -0.013023 | 0.015788 | 0.019181 | -0.013650 | 0.018024 | -0.001099 | 0.051528 | -0.024802 | 0.012777 | -0.014517 | ... | -0.012541 | 0.004190 | -0.016429 | -0.000340 | -0.009681 | 0.001864 | 0.004273 | 1.000000 | -0.432989 | -0.288871 |
| MaritalStatus_Married | -0.159808 | 0.085990 | -0.029876 | 0.038336 | 0.022257 | -0.064517 | -0.012167 | -0.001463 | -0.000364 | 0.014894 | ... | 0.035047 | 0.010466 | 0.024625 | 0.057496 | -0.028832 | -0.062740 | 0.019615 | -0.432989 | 1.000000 | -0.737892 |
| MaritalStatus_Single | 0.179481 | -0.103150 | 0.017370 | -0.030496 | -0.037134 | 0.069345 | -0.025659 | 0.020125 | -0.009180 | -0.004949 | ... | -0.027833 | -0.014253 | -0.013853 | -0.060812 | 0.037872 | 0.065241 | -0.024032 | -0.288871 | -0.737892 | 1.000000 |
35 rows × 35 columns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
figure(figsize=(10, 8), dpi=150)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.8)
plt.show()
Sort the correlation in descending order
corr_matrix.sort_values(by = 'Churn', ascending=False)
| Churn | Tenure | CityTier | WarehouseToHome | HourSpendOnApp | NumberOfDeviceRegistered | SatisfactionScore | NumberOfAddress | Complain | OrderAmountHikeFromlastYear | ... | Gender_Male | PreferedOrderCat_Fashion | PreferedOrderCat_Grocery | PreferedOrderCat_Laptop & Accessory | PreferedOrderCat_Mobile | PreferedOrderCat_Mobile Phone | PreferedOrderCat_Others | MaritalStatus_Divorced | MaritalStatus_Married | MaritalStatus_Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Churn | 1.000000 | -0.340013 | 0.073858 | 0.087318 | 0.060845 | 0.149041 | 0.095759 | 0.076336 | 0.238137 | 0.017193 | ... | 0.033792 | 0.013086 | -0.017880 | -0.184584 | 0.016675 | 0.181683 | 0.010208 | -0.013023 | -0.159808 | 0.179481 |
| Complain | 0.238137 | -0.035228 | -0.006122 | 0.003829 | 0.020413 | 0.018881 | -0.044533 | -0.017363 | 1.000000 | 0.003260 | ... | -0.039522 | -0.000022 | 0.004534 | -0.018126 | -0.005327 | 0.019793 | 0.007866 | 0.012777 | -0.000364 | -0.009180 |
| PreferedOrderCat_Mobile Phone | 0.181683 | -0.211477 | -0.281079 | -0.028939 | 0.281309 | 0.188563 | 0.011297 | -0.023102 | 0.019793 | 0.080301 | ... | 0.053910 | -0.253118 | -0.027697 | -0.721851 | -0.125238 | 1.000000 | -0.048049 | 0.001864 | -0.062740 | 0.065241 |
| MaritalStatus_Single | 0.179481 | -0.103150 | 0.017370 | -0.030496 | -0.037134 | 0.069345 | -0.025659 | 0.020125 | -0.009180 | -0.004949 | ... | -0.027833 | -0.014253 | -0.013853 | -0.060812 | 0.037872 | 0.065241 | -0.024032 | -0.288871 | -0.737892 | 1.000000 |
| NumberOfDeviceRegistered | 0.149041 | -0.058752 | -0.008616 | 0.024582 | 0.293021 | 1.000000 | -0.017788 | 0.066684 | 0.018881 | 0.083342 | ... | -0.026706 | -0.091604 | 0.022548 | -0.080608 | -0.125343 | 0.188563 | 0.035364 | -0.001099 | -0.064517 | 0.069345 |
| SatisfactionScore | 0.095759 | -0.009972 | -0.019494 | 0.000434 | 0.039879 | -0.017788 | 1.000000 | 0.054627 | -0.044533 | -0.008143 | ... | -0.032398 | 0.016587 | 0.027097 | -0.025448 | -0.006238 | 0.011297 | 0.030401 | 0.051528 | -0.012167 | -0.025659 |
| WarehouseToHome | 0.087318 | -0.016353 | 0.004457 | 1.000000 | 0.052731 | 0.024582 | 0.000434 | 0.003422 | 0.003829 | 0.031975 | ... | 0.035581 | -0.009435 | 0.003512 | 0.043231 | -0.015441 | -0.028939 | -0.035530 | -0.013650 | 0.038336 | -0.030496 |
| NumberOfAddress | 0.076336 | 0.196547 | -0.046406 | 0.003422 | 0.124962 | 0.066684 | 0.054627 | 1.000000 | -0.017363 | 0.010275 | ... | -0.030980 | -0.040136 | -0.005968 | 0.059515 | -0.031229 | -0.023102 | -0.004349 | -0.024802 | -0.001463 | 0.020125 |
| CityTier | 0.073858 | -0.057414 | 1.000000 | 0.004457 | -0.070035 | -0.008616 | -0.019494 | -0.046406 | -0.006122 | -0.027628 | ... | -0.018072 | 0.083035 | -0.001750 | 0.245824 | -0.097493 | -0.281079 | -0.011248 | 0.019181 | -0.029876 | 0.017370 |
| HourSpendOnApp | 0.060845 | -0.029818 | -0.070035 | 0.052731 | 1.000000 | 0.293021 | 0.039879 | 0.124962 | 0.020413 | 0.096827 | ... | -0.009952 | -0.147830 | -0.017382 | -0.130819 | -0.112896 | 0.281309 | 0.023109 | 0.018024 | 0.022257 | -0.037134 |
| PreferredPaymentMode_COD | 0.059430 | -0.027497 | -0.049111 | -0.004197 | -0.004520 | 0.002875 | 0.011974 | -0.009988 | -0.017256 | -0.059288 | ... | 0.027718 | -0.064820 | -0.011748 | -0.053643 | 0.053221 | 0.085898 | -0.020380 | 0.006594 | -0.029529 | 0.026426 |
| PreferredLoginDevice_Phone | 0.053023 | -0.096773 | -0.125199 | -0.022497 | -0.129642 | -0.084051 | 0.050272 | -0.071176 | 0.001261 | -0.005919 | ... | 0.018568 | -0.176046 | -0.019492 | -0.183909 | 0.277125 | 0.220374 | -0.033815 | 0.012652 | -0.030677 | 0.023109 |
| PreferredPaymentMode_E wallet | 0.043972 | 0.016709 | 0.503165 | 0.033578 | -0.039522 | -0.008889 | 0.026658 | -0.027925 | 0.001807 | -0.036139 | ... | -0.022827 | 0.074168 | -0.014552 | 0.141394 | -0.065803 | -0.175791 | -0.001349 | 0.020559 | 0.017060 | -0.033513 |
| PreferredLoginDevice_Computer | 0.040890 | -0.010308 | -0.000184 | 0.022101 | -0.004911 | 0.025628 | -0.035021 | -0.024398 | -0.014878 | -0.000220 | ... | 0.025482 | -0.054930 | 0.032595 | 0.056688 | -0.036701 | -0.012669 | 0.005916 | -0.016560 | -0.006364 | 0.019158 |
| Gender_Male | 0.033792 | -0.052123 | -0.018072 | 0.035581 | -0.009952 | -0.026706 | -0.032398 | -0.030980 | -0.039522 | 0.010081 | ... | 1.000000 | -0.009374 | 0.005292 | -0.043367 | -0.001884 | 0.053910 | -0.006532 | -0.012541 | 0.035047 | -0.027833 |
| PreferredPaymentMode_CC | 0.030729 | -0.032474 | -0.073109 | 0.007388 | -0.054905 | -0.036118 | 0.013975 | 0.018801 | 0.019183 | -0.028536 | ... | 0.010948 | -0.035283 | -0.003861 | -0.089557 | 0.504562 | -0.067153 | -0.006698 | 0.015132 | -0.018716 | 0.008548 |
| OrderAmountHikeFromlastYear | 0.017193 | 0.012197 | -0.027628 | 0.031975 | 0.096827 | 0.083342 | -0.008143 | 0.010275 | 0.003260 | 1.000000 | ... | 0.010081 | -0.051335 | 0.041476 | -0.024964 | -0.067089 | 0.080301 | 0.021090 | -0.014517 | 0.014894 | -0.004949 |
| PreferedOrderCat_Mobile | 0.016675 | -0.033466 | -0.097493 | -0.015441 | -0.112896 | -0.125343 | -0.006238 | -0.031229 | -0.005327 | -0.067089 | ... | -0.001884 | -0.065803 | -0.007200 | -0.187659 | 1.000000 | -0.125238 | -0.012491 | -0.009681 | -0.028832 | 0.037872 |
| PreferedOrderCat_Fashion | 0.013086 | 0.130259 | 0.083035 | -0.009435 | -0.147830 | -0.091604 | 0.016587 | -0.040136 | -0.000022 | -0.051335 | ... | -0.009374 | 1.000000 | -0.014552 | -0.379275 | -0.065803 | -0.253118 | -0.025246 | 0.004190 | 0.010466 | -0.014253 |
| CouponUsed | 0.010982 | 0.075959 | 0.013313 | -0.014311 | 0.160239 | 0.114066 | 0.006687 | -0.017851 | 0.005472 | 0.048959 | ... | -0.041493 | 0.078762 | 0.078182 | -0.038409 | -0.088426 | 0.001006 | 0.082800 | 0.043312 | 0.010976 | -0.044087 |
| PreferedOrderCat_Others | 0.010208 | 0.103702 | -0.011248 | -0.035530 | 0.023109 | 0.035364 | 0.030401 | -0.004349 | 0.007866 | 0.021090 | ... | -0.006532 | -0.025246 | -0.002762 | -0.071997 | -0.012491 | -0.048049 | 1.000000 | 0.004273 | 0.019615 | -0.024032 |
| PreferredPaymentMode_UPI | 0.000938 | -0.043942 | -0.036088 | -0.021470 | -0.030052 | 0.010839 | 0.009321 | 0.007556 | 0.012421 | 0.033331 | ... | 0.039949 | -0.016992 | -0.011405 | 0.003837 | -0.034355 | 0.024280 | -0.019785 | -0.017968 | 0.024753 | -0.012836 |
| OrderCount | -0.001962 | 0.112339 | 0.035018 | 0.008800 | 0.092601 | 0.069559 | -0.004622 | -0.079413 | -0.001917 | 0.042950 | ... | -0.035418 | 0.154599 | 0.126821 | -0.018610 | -0.082487 | -0.083855 | 0.118690 | 0.022122 | 0.043676 | -0.062952 |
| PreferredPaymentMode_Cash on Delivery | -0.006498 | 0.069208 | -0.004977 | -0.007623 | 0.016056 | -0.011988 | 0.005600 | -0.004053 | -0.013373 | 0.000700 | ... | -0.009100 | 0.193691 | 0.054822 | -0.065981 | -0.020480 | -0.068682 | 0.060785 | 0.007005 | 0.017955 | -0.024315 |
| MaritalStatus_Divorced | -0.013023 | 0.015788 | 0.019181 | -0.013650 | 0.018024 | -0.001099 | 0.051528 | -0.024802 | 0.012777 | -0.014517 | ... | -0.012541 | 0.004190 | -0.016429 | -0.000340 | -0.009681 | 0.001864 | 0.004273 | 1.000000 | -0.432989 | -0.288871 |
| PreferedOrderCat_Grocery | -0.017880 | 0.091251 | -0.001750 | 0.003512 | -0.017382 | 0.022548 | 0.027097 | -0.005968 | 0.004534 | 0.041476 | ... | 0.005292 | -0.014552 | 1.000000 | -0.041501 | -0.007200 | -0.027697 | -0.002762 | -0.016429 | 0.024625 | -0.013853 |
| PreferredPaymentMode_Credit Card | -0.029392 | -0.024243 | -0.150637 | 0.036141 | 0.033820 | 0.000782 | 0.043184 | 0.003998 | 0.003623 | 0.036106 | ... | -0.028839 | -0.039491 | 0.032185 | -0.033678 | -0.094304 | 0.098421 | -0.019855 | -0.001500 | 0.003140 | -0.002212 |
| Gender_Female | -0.033792 | 0.052123 | 0.018072 | -0.035581 | 0.009952 | 0.026706 | 0.032398 | 0.030980 | 0.039522 | -0.010081 | ... | -1.000000 | 0.009374 | -0.005292 | 0.043367 | 0.001884 | -0.053910 | 0.006532 | 0.012541 | -0.035047 | 0.027833 |
| PreferredPaymentMode_Debit Card | -0.039232 | 0.040958 | -0.127528 | -0.041475 | 0.020115 | 0.007461 | -0.073269 | 0.013272 | -0.002413 | 0.010242 | ... | 0.004968 | -0.004245 | -0.019560 | -0.001249 | 0.026243 | -0.008097 | 0.028682 | -0.010594 | -0.011571 | 0.020221 |
| CashbackAmount | -0.058756 | 0.213852 | 0.146569 | -0.012433 | 0.131281 | 0.113504 | 0.012061 | 0.096752 | 0.010389 | 0.014215 | ... | -0.020263 | 0.455466 | 0.014736 | 0.101787 | -0.232654 | -0.401303 | 0.443994 | 0.013801 | 0.034234 | -0.046693 |
| PreferredLoginDevice_Mobile Phone | -0.079120 | 0.085754 | 0.098949 | -0.002403 | 0.106765 | 0.042948 | -0.007731 | 0.078405 | 0.012571 | 0.004871 | ... | -0.037886 | 0.188988 | -0.014342 | 0.093415 | -0.185187 | -0.162324 | 0.021285 | 0.005117 | 0.030007 | -0.035702 |
| DaySinceLastOrder | -0.139254 | 0.118932 | 0.014304 | 0.010646 | 0.019526 | -0.049817 | 0.012950 | -0.148784 | -0.056315 | 0.017716 | ... | -0.018478 | 0.157977 | 0.035204 | 0.122257 | -0.082597 | -0.221627 | 0.071340 | -0.007111 | 0.034002 | -0.030789 |
| MaritalStatus_Married | -0.159808 | 0.085990 | -0.029876 | 0.038336 | 0.022257 | -0.064517 | -0.012167 | -0.001463 | -0.000364 | 0.014894 | ... | 0.035047 | 0.010466 | 0.024625 | 0.057496 | -0.028832 | -0.062740 | 0.019615 | -0.432989 | 1.000000 | -0.737892 |
| PreferedOrderCat_Laptop & Accessory | -0.184584 | 0.104482 | 0.245824 | 0.043231 | -0.130819 | -0.080608 | -0.025448 | 0.059515 | -0.018126 | -0.024964 | ... | -0.043367 | -0.379275 | -0.041501 | 1.000000 | -0.187659 | -0.721851 | -0.071997 | -0.000340 | 0.057496 | -0.060812 |
| Tenure | -0.340013 | 1.000000 | -0.057414 | -0.016353 | -0.029818 | -0.058752 | -0.009972 | 0.196547 | -0.035228 | 0.012197 | ... | -0.052123 | 0.130259 | 0.091251 | 0.104482 | -0.033466 | -0.211477 | 0.103702 | 0.015788 | 0.085990 | -0.103150 |
35 rows × 35 columns
Fill nulls in each column
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
for col in df.columns:
nullCount = df[col].isnull().sum()
if nullCount > 0:
df[col]=imputer.fit_transform(df[[col]])
print("Filled", nullCount, "nulls in", col)
Filled 264 nulls in Tenure Filled 251 nulls in WarehouseToHome Filled 255 nulls in HourSpendOnApp Filled 265 nulls in OrderAmountHikeFromlastYear Filled 256 nulls in CouponUsed Filled 258 nulls in OrderCount Filled 307 nulls in DaySinceLastOrder
df_final.head(5)
| Churn | Tenure | CityTier | WarehouseToHome | HourSpendOnApp | NumberOfDeviceRegistered | SatisfactionScore | NumberOfAddress | Complain | OrderAmountHikeFromlastYear | ... | Gender_Male | PreferedOrderCat_Fashion | PreferedOrderCat_Grocery | PreferedOrderCat_Laptop & Accessory | PreferedOrderCat_Mobile | PreferedOrderCat_Mobile Phone | PreferedOrderCat_Others | MaritalStatus_Divorced | MaritalStatus_Married | MaritalStatus_Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 4.0 | 3 | 6.0 | 3.0 | 3 | 2 | 9 | 1 | 11.0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 1 | 0.0 | 3 | 15.0 | 2.0 | 4 | 5 | 8 | 0 | 23.0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 5 | 1 | 0.0 | 1 | 22.0 | 3.0 | 5 | 5 | 2 | 1 | 22.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 11 | 1 | 11.0 | 1 | 6.0 | 3.0 | 4 | 3 | 10 | 1 | 13.0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 12 | 1 | 0.0 | 1 | 11.0 | 2.0 | 3 | 3 | 2 | 1 | 13.0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
5 rows × 35 columns
from sklearn.linear_model import LogisticRegression
x = df_final.drop(['Churn'],axis = 1)
x.head(5)
y = df_final['Churn']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
#check the shape of X_train and X_test
x_train.shape, x_test.shape
((3019, 34), (755, 34))
clf = LogisticRegression(random_state=0).fit(x_train,y_train)
/Users/jacquie/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
clf.fit(x_train, y_train)
/Users/jacquie/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
LogisticRegression(random_state=0)
#predict result
y_pred = clf.predict(x_test)
Train and show train & test accuracy
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix
train_accuracy = accuracy_score(y_pred, y_test)*100
train_accuracy
89.27152317880794
y_pred_train = clf.predict(x_train)
y_pred_train
array([0, 0, 1, ..., 0, 1, 0])
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))
Training-set accuracy score: 0.8917
Feel free to use any libraries / stackoverflow
print(classification_report(y_pred, y_test, digits=6))
precision recall f1-score support
0 0.958333 0.920000 0.938776 675
1 0.495327 0.662500 0.566845 80
accuracy 0.892715 755
macro avg 0.726830 0.791250 0.752810 755
weighted avg 0.909273 0.892715 0.899366 755
print(classification_report(y_train, y_pred_train, digits=6))
Describe the performance of model and steps you could take to improve it
print('Confusion matrix:\n', confusion_matrix(y_pred, y_test))
Confusion matrix: [[621 54] [ 27 53]]
print('Confusion matrix:\n', confusion_matrix(y_pred_train, y_train))
Confusion matrix: [[2404 236] [ 91 288]]
#check for overfitting and underfitting
print('Training set score: {:.4f}'.format(clf.score(x_train, y_train)))
Training set score: 0.8917
print('Test set score: {:.4f}'.format(clf.score(x_test, y_test)))
Test set score: 0.8927
The training-set accuracy score is 0.8917 while the test-set accuracy to be 0.8927. These two values are quite comparable. So, there is no question of overfitting.
Next Step, we could perform hyperparameter optimization using gridsearch CV to improve the performance for this particular model